Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EFM] Recoverable Random Beacon State Machine #6771

Open
wants to merge 35 commits into
base: feature/efm-recovery
Choose a base branch
from

Conversation

durkmurder
Copy link
Member

@durkmurder durkmurder commented Dec 2, 2024

#6725

Context

The goal of this PR is to implement a state machine which allows to cover all possible cases which can happen when performing DKG. Previously our implementation was enforcing some of the rules but not all of them, what made things worse is that logic and interfaces wasn't explicit enough so an engineer can easily understand it. In this PR we have made an attempt to implement a robust approach to handling possible state and state transitions for both happy path and recovery path.

Some key points:

  • badger.RecoverablePrivateBeaconKeyStateMachine implements the state machine itself.
  • State machine transitions are initiated from dkg.ReactorEngine and dkg.BeaconKeyRecovery.
  • State machine doesn't care about origin of the key. It could be obtained from different sources(successful DKG or manually injected by operator) all that matters that key is committed. Caller needs to ensure that respective public key is part of the EpochCommit for respective epoch.

Link to the diagram which describes the state machine:

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 1.88679% with 52 lines in your changes missing coverage. Please review.

Project coverage is 43.40%. Comparing base (7c71c41) to head (6ea64d6).

Files with missing lines Patch % Lines
cmd/consensus/main.go 0.00% 38 Missing ⚠️
model/flow/dkg.go 0.00% 10 Missing ⚠️
cmd/util/cmd/common/node_info.go 0.00% 3 Missing ⚠️
engine/common/grpc/forwarder/forwarder.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                   Coverage Diff                    @@
##           feature/efm-recovery    #6771      +/-   ##
========================================================
+ Coverage                 41.71%   43.40%   +1.69%     
========================================================
  Files                      2030     1529     -501     
  Lines                    180459   141392   -39067     
========================================================
- Hits                      75285    61378   -13907     
+ Misses                    98978    75015   -23963     
+ Partials                   6196     4999    -1197     
Flag Coverage Δ
unittests 43.40% <1.88%> (+1.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@durkmurder durkmurder marked this pull request as ready for review December 3, 2024 18:43
@@ -154,24 +154,24 @@ func (e *ReactorEngine) startDKGForEpoch(currentEpochCounter uint64, first *flow
Hex("first_block_id", firstID[:]). // id of first block in EpochSetup phase
Logger()

// if we have started the dkg for this epoch already, exit
// if we have dkgState the dkg for this epoch already, exit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if we have dkgState the dkg for this epoch already, exit
// if we have started the dkg for this epoch already, exit

There are a few other instances of this replacement in this file. Assuming it's a rename gone wrong.

@@ -427,10 +431,10 @@ func (e *ReactorEngine) end(nextEpochCounter uint64) func() error {
// has already abandoned the happy path, because on the happy path the ReactorEngine is the only writer.
// Then this function just stops and returns without error.
e.log.Warn().Err(err).Msgf("node %s with index %d failed DKG locally", e.me.NodeID(), e.controller.GetIndex())
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateDKGFailure)
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
if err != nil {
if errors.Is(err, storage.ErrAlreadyExists) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetDKGState doesn't return this error type, it returns InvalidTransitionRandomBeaconStateMachineErr.

The comment above is outdated now, but it implies that we might already have persisted a failure state:

By convention, if we are leaving the happy path, we want to persist the first failure symptom

The state machine does not allow transitions from one state to itself (except for RandomBeaconKeyCommitted). If, as the current comment suggests, a failure state is already set at this point, we will throw InvalidTransitionRandomBeaconStateMachineErr as DKGStateFailure-> DKGStateFailure is not a valid transition. I don't think this is the case, but I'll return to this after reading further through the PR.

Suggestions:

  • update comment on lines 428-432
  • remove ErrAlreadyExists check

Comment on lines +23 to +24
// Despite the key origin this is a terminal state which defines a safe Random Beacon key for the next epoch and allow node
// to participate in the Random Beacon protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Despite the key origin this is a terminal state which defines a safe Random Beacon key for the next epoch and allow node
// to participate in the Random Beacon protocol.
// Regardless of the key origin, this is a terminal state which defines a safe Random Beacon key for the next epoch and allows the node
// to participate in the Random Beacon protocol.

Comment on lines +23 to +24
flow.DKGStateStarted: {flow.DKGStateCompleted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.DKGStateCompleted: {flow.RandomBeaconKeyCommitted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
flow.DKGStateStarted: {flow.DKGStateCompleted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.DKGStateCompleted: {flow.RandomBeaconKeyCommitted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.DKGStateStarted: {flow.DKGStateCompleted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.DKGStateCompleted: {flow.RandomBeaconKeyCommitted, flow.DKGStateFailure},

var allowedStateTransitions = map[flow.DKGState][]flow.DKGState{
flow.DKGStateStarted: {flow.DKGStateCompleted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.DKGStateCompleted: {flow.RandomBeaconKeyCommitted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.RandomBeaconKeyCommitted: {flow.RandomBeaconKeyCommitted},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we permit self-transitions for only this state?

RetrieveMyBeaconPrivateKey(epochCounter uint64) (key crypto.PrivateKey, safe bool, err error)
}

// DKGStateReader is a ready-only interface for reading state of the Random Beacon Recoverable State Machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// DKGStateReader is a ready-only interface for reading state of the Random Beacon Recoverable State Machine.
// DKGStateReader is a read-only interface for reading state of the Random Beacon Recoverable State Machine.

// No errors are expected during normal operations.
UpsertMyBeaconPrivateKey(epochCounter uint64, key crypto.PrivateKey) error
}

// InvalidTransitionRandomBeaconStateMachineErr is a sentinel error that is returned in case an invalid state transition is attempted.
type InvalidTransitionRandomBeaconStateMachineErr struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type InvalidTransitionRandomBeaconStateMachineErr struct {
type InvalidDKGStateTransitionError struct {

The states are called flow.DKGState: I think we can use the more concise name here.

Minor nitpick on error naming convention: ErrXYZ for sentinels, XYZError for types.

err = store.SetDKGStarted(epochCounter)
assert.NoError(t, err)
t.Run("-> flow.DKGStateStarted, should be allowed", func(t *testing.T) {
epochCounter++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epochCounter++

I'm guessing this was a previous strategy to ensure that all sub-tests use a different epoch counter. But currently each sub-test is calling setupState(), which uses a random epoch counter. Unless I'm missing something, I think we can remove.

})

t.Run("-> flow.DKGStateFailure, should be allowed", func(t *testing.T) {
epochCounter++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epochCounter++

epochCounter := setupState()

actualState, err := store.GetDKGState(epochCounter)
require.NoError(t, err, storage.ErrNotFound)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.NoError(t, err, storage.ErrNotFound)
require.NoError(t, err)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants